Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update aggregate_spatial process implementation #268

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

jzvolensky
Copy link
Contributor

  1. Related issue: aggregate_spatial improvements EOEPCA/openeo-processes-dask#1

We have been investigating the performance of the aggregate_spatial with different datasets. Currently it is not doing very well.

We have discovered that certain methods such as iterate do not scale well at all. The exactextract method seems to be doing much better (see here https://gist.github.com/clausmichele/6d9bba3a82f39c8c91c0cf5d263e1521)

The current idea is that we make the exactextract method the default method. Comment from Michele:

"However, since it doesn't allow callbacks as a function to be applied to the data within the polygon, the code should select a different method if a callback/function has to be applied to the data."

So this will need to be taken into account.

  1. Related issue: aggregate_spatial: crs of data and geometry mismatch undefined openeo-processes#499

The following improvement will be made: The geometries get reprojected to the data CRS and the resulting vector datacube has the CRS of the input data.

This is currently work in progress. Feel free to add some comments or suggestions!
Thanks.

cc @clausmichele

@clausmichele
Copy link
Member

@jzvolensky the tests are failing, please try to fix them.

@jzvolensky
Copy link
Contributor Author

Hi @clausmichele @ValentinaHutter,

So I ran into a bit of trouble trying to rework the aggregate_spatial process.

We have discussed the possibility of adding a context parameter which will allow the user to set which extraction method to use, so in case of xvec iterate or exactextract. Since we know that exactextract does not support all possible combinations of the statistical processes, we have to make some form of a distinction based on what is present in the process graph.

Now before dealing with that I tried to simply implement the context which changes the method. Here we run into an issue.

first we add the option to add context:

def aggregate_spatial(
    data: RasterCube,
    geometries,
    reducer: Union[str, Callable],
    chunk_size: int = 2,
    context: Optional[Dict[str, str]] = None,
) -> VectorCube:

then here is an example of how we can supply the method:

aggregate = s2_datacube.aggregate_spatial(geometries=polys, reducer="mean", context={"method": "exactextract"})

The problem is that in the real world the reducer is provided as this partial function object like this:

REDUCER: functools.partial(<function OpenEOProcessGraph._map_node_to_callable.<locals>.node_callable at 0x7fd3bfdf3250>, parent_callables=[])

Which is fine for the iterate method, however the exactextract does not support this and therefore you fail with

ValueError: functools.partial(<function OpenEOProcessGraph._map_node_to_callable.<locals>.node_callable at 0x7fd3bfdf3250>, parent_callables=[]) is not a valid aggregation.

I tried to dig in the reducer object but I was not able to find any relevant part we could extract and turn into string for exactextract to use as stats. Is there a way to get this information from the object?

Otherwise, we are not able to match the user provider reducer statistic with what is support by exactextract.

Let me know if you have any ideas. Ideally what we want to end up with is to have a list of processes which match the ones in openeo and exactextract such as mean,min,max etc. and based on what is provided we select the method for better performance.

@clausmichele
Copy link
Member

@jzvolensky you can get it in this way:

    _process = partial(
        process_registry["mean"].implementation,
        ignore_nodata=True,
        data=ParameterReference(from_parameter="data"),
    )
    print(_process.__repr__())

outputs

functools.partial(<function mean at 0x7f81a51a1cf0>, ignore_nodata=True, data=ParameterReference(from_parameter='data'))

@jzvolensky
Copy link
Contributor Author

@jzvolensky you can get it in this way:

    _process = partial(
        process_registry["mean"].implementation,
        ignore_nodata=True,
        data=ParameterReference(from_parameter="data"),
    )
    print(_process.__repr__())

outputs

functools.partial(<function mean at 0x7f81a51a1cf0>, ignore_nodata=True, data=ParameterReference(from_parameter='data'))

This is great, and will probably be useful for the test. however in the implementation code it doesnt do anything differently than what I have been doing. Basically in the implementation we need to deconstruct the reducer object but

    print(f"REDUCER: {reducer}")

functools.partial(<function OpenEOProcessGraph._map_node_to_callable.<locals>.node_callable at 0x7fc62e859f30>, parent_callables=[])
    print(f"REDUCER REPR: {reducer.__repr__()}")

functools.partial(<function OpenEOProcessGraph._map_node_to_callable.<locals>.node_callable at 0x7fc62e859f30>, parent_callables=[])

which is unfortunate :(

@clausmichele
Copy link
Member

clausmichele commented Oct 1, 2024

@jzvolensky try with

import inspect
inspect.getsource(reducer.func)

Edit: after several trials I'm also stuck. Probably it's not possible, with the current openeo-pg-parser-networkx code, to reconstruct from a Callable the various functions that are being called in series? @GeraldIr @ValentinaHutter

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants